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ABSTRACT 

We consider a collection of prediction experiments, which 
are clustered in the sense that groups of experiments ex- 
hibit similar relationship between the predictor and response 
variables. The experiment clusters as well as the regres- 
sion relationships are unknown. The regression relation- 
ships define the experiment clusters, and in general, the 
predictor and response variables may not exhibit any clus- 
tering. We call this prediction problem clustered regres- 
sion with unknown clusters (CRUC) and in this paper we 
focus on linear regression. We study and compare several 
methods for CRUC, demonstrate their applicability to the 
Yahoo Learning-to-rank Challenge (YLRC) dataset, and in- 
vestigate an associated mathematical model. CRUC is at 
the crossroads of many prior works and we study several 
prediction algorithms with diverse origins: an adaptation 
of the expectation-maximization algorithm, an approach in- 
spired by K-means clustering, the singular value threshold- 
ing approach to matrix rank minimization under quadratic 
constraints, an adaptation of the Curds and Whey method 
in multiple regression, and a local regression (LoR) scheme 
reminiscent of neighborhood methods in collaborative filter- 
ing. Based on empirical evaluation on the YLRC dataset 
as well as simulated data, we identify the LoR method as a 
good practical choice: it yields best or near-best prediction 
performance at a reasonable computational load, and it is 
less sensitive to the choice of the algorithm parameter. We 
also provide some analysis of the LoR method for an asso- 
ciated mathematical model, which sheds light on optimal 
parameter choice and prediction performance. 
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1. INTRODUCTION 

Regression, which estimates the relationship between re- 
sponse variables and predictor variables, has a rich history 
(see for example |16|). It is employed in a variety of 
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fields such as machine learning, signal processing, etc. Often 
the response variables and predictor variables are collected 
from different experiments, and while the data may be dif- 
ferent across experiments, there may be reason to believe 
that several experiments share the same regression relation- 
ship (such as the same regression parameter vector in case 
of linear regression). For example, in the Yahoo Learning- 
to-rank Challenge (YLRC) dataset [Tt], there are several 
queries, and for each query, there are multiple URLs with 
relevance scores. This can be thought of as a regression 
problem between the relevance scores as the response vari- 
ables and the feature vectors of the query-URL pair as the 
predictor variables. Since we expect many queries to be re- 
lated to each other, we can postulate clusters of queries, with 
common regression parameters within a cluster, and differ- 
ent parameters across clusters. In this paper, we consider 
such a regression problem where the experiment clusters are 
unknown. The term "clustered regression" has already been 
used to refer to the case when the clusters are known [l2 13 



Hence, we refer to our problem as clustered regression with 
unknown clusters (CRUC). We study several approaches to 
this problem arising from different perspectives. For our 
study, we use the YLRC dataset]^ as well as a mathemati- 
cal model with corresponding analysis and simulations. In 
the remainder of this section, we briefly describe the algo- 
rithms we propose, their relationship to prior literature, and 
we outline our main results. 

In Section[2j we describe our basic setup in detail. In brief, 
we consider M experiments with associated predictor and 
response variables, and we focus on linear regression. For 
this prediction problem, two methods immediately come to 
mind: a common linear regression fit across all experiments 
and an individual linear regression for each experiment. If 
we expect that several but not all of the experiments share 
the same regression relationship, then we wish for methods 
that lie between these two extremes. In Section |3] we pro- 
pose several such approaches and below we briefly outline 
them. 

The EM Algorithm: We can postulate that there are only 
1 < Kq <gi M distinct regression vectors (that is, we have 
Ko clusters of experiments). We can treat the cluster index 
of each experiment as missing data, and then use the EM 
algorithm [9] to iteratively compute an estimate of the clus- 



^ While the focus of YLRC is on ranking URLs, our focus 
is on an associated regression problem. Our primary aim 
to show that the CRUC framework is meaningful for a real 
world dataset such as the YLRC dataset and we do not 
present any performance study of the ranking aspect. 



ter indices and the regression vectors. 

K-means (KM) algorithm: Tlie classical K-means clus- 
tering algorithm |4, Section 9.1] iteratively computes the 
cluster indices and the cluster centroids. In our context, we 
replace the centroid computation with least-squares regres- 
sion estimation. 

Singular value thresholding (SVT): If we consider the 
d X M matrix of regression vectors, then for Kq < d, it has 
a small rank. Hence another formulation is to minimize the 
rank of the regression vector matrix, subject to the condi- 
tion that the corresponding mean-square error is small (a 
quadratic constraint). Such an optimization problem has 
been considered in [t] and we adapt their SVT algorithm to 
our problem. 

Curds and Whey (CW): In multiple regression [6], the 
same predictor variables are used to predict multiple re- 
sponse variables. For M = Ko and the same predictor vari- 
ables across the experiments, our model reduces to that of 
multiple regression. In other words, our setup is a general- 
ization of multiple regression, and we can modify the CW 
method from [H] to our case. (Strictly speaking, for com- 
plexity reasons, we consider a "local" variant of this method 
as described in the neighborhood methods below.) 
Local Regression (LoR): Neighborhood methods have 
proved their worth in collaborative filtering as a good scal- 
able approach [2j [l] |3]. Motivated by such methods, we 
consider local regression for prediction, where the "local" 
neighborhood of an experiment is identified from individual 
estimates of the regression vectors. 

In Section [4j we report the prediction performance of the 
various algorithms on the YLRC dataset as well as simu- 
lated data. We compare their implementation complexities, 
runtimes, mean-square error (MSE) and classification er- 
ror (CE). In particular, we identify the LoR method as a 
good choice that yields near-best performance with accept- 
able runtime. The LoR method is also less sensitive to the 
choice of algorithm parameters. Motivated by this finding, 
we analyze the LoR method for a mathematical model with 
Kq < M clusters of experiments and Gaussian prediction er- 
rors. Our analysis exploits known results on the asymptotic 
normality of maximum likelihood signal parameter estimates 
[14| . We find that the optimal neighborhood size of the LoR 
method coincides with the true cluster size, and this gives 
insight into the query cluster sizes in the YLRC dataset. We 
also study performance of the LoR method for different noise 
levels. For the entire range of noise values, the LoR method 
improves over individual as well as collective regression. At 
small noise levels, the LoR method finds the correct neigh- 
borhood, and as expected, the MSE is smaller than that for 
individual regression by a factor equal to the cluster size. As 
the noise level increases, its performance degrades gracefully 
and approaches that of collective regression. The conclusion 
is given in Section [5] and in the Appendices we fill in several 
details. 

Notation: Matrices and vectors are written in bold upper 
and bold lower case letters respectively. All vectors are col- 
umn vectors and [.]'^ denotes the transpose. By A/'(.|/x, cr'^), 
we denote the Gaussian density with mean fi and variance 

<7^ i.e., AA(a;|M,a2) = -J=-e'^^. 

2. BASIC SETUP 

Consider M experiments and let Xmn G K'', Vmn € R, 



n — 1, ...,Nr,i be the prediction and response variables re- 
spectively for experiment m, 1 < m < M. In general, we 
would like to consider regression of the form: 

where Wmn is the prediction error. If we expect different 
experiments to have similar regression functions gm(-), then 
by pooling their data together, we hope to be able to esti- 
mate Qm better, and hence obtain improved prediction. If 
the response and predictor variables across experiments ex- 
hibit clustering, then it is conceptually easy to identify the 
experiment clusters. In this paper, our interest is in the 
case where the regression functions exhibit clustering, even 
though the response and predictor variables may not show 
any clusteringrl Our goal is to study mechanisms for pooling 
data from different experiments to improve the estimation 
of the regression function and we focus on the special case 
of linear regression: 

+ Wmn- (1) 

In Section |3] we suggest several methods to pool data from 
across experiments to improve estimation of hm and conse- 
quently improve prediction performance. 

Is the above viewpoint useful in any applications? Our 
empirical results in Section|4]suggests that the YLRC dataset 
benefits from this viewpoint. For this dataset, each exper- 
iment corresponds to a web search with a given query and 
trials of an experiment correspond to the different URLs 
in the search results. Each predictor variable is a feature 
vector of the corresponding query-URL pair, and the corre- 
sponding response variable is a score indicating relevance of 
that URL to the query. Relevance scores are in the range 
of (irrelevant) to 4 (perfectly relevant). We expect that 
many queries are related to each other and hence pooling 
their data together may improve prediction. The challenge 
lies in the fact that, apriori, we do not know which queries 
are similar, and the feature vectors/relevance scores do not 
exhibit clustering on their own. 

To gain further insight, we also test our methods on simu- 
lated data. For simulating the data as well as the mathemat- 
ical analysis, the following model is useful. Suppose that the 
regression vectors hm are chosen uniformly randomly from 
a set of Kq distinct vectors, and suppose that {wmn} are 
independent Gaussian with mean and variance . For 
this model, we have Kq unknown clusters of experiments. 
Since we are interested in the extreme case where the data 
across experiments is diverse and does not exhibit any obvi- 
ous clustering, we consider independent predictor variables 
across experiments. In particular, we choose {xm„} to be 
i.i.d. Gaussian with zero mean and identity covariance ma- 
trix. The K(i true regression vectors are generated randomly, 
uniformly on the unit sphere in R"^. 

3. ALGORITHMS FOR CRUC 

In this section, we briefly discuss various approaches to 
pool data from different experiments. 

3.1 Individual Regressions (IR) 

For every experiment m, we fit a different least square re- 
gression on the training data. Let :— [x^jj, ...,x^^^J^ 

^However, we note that our algorithms work even if the data 
itself exhibits clustering. 



denote the matrix of predictor variables for the mth experi- 
ments, while the corresponding vector of response variables 
is denoted by Ym := [ymi, ■■■,ymN^]'^ ■ Then the least-square 
estimate for the regression vector is 4^ 



-1 



3.2 Collective Regression (CR) 

Instead of having a different linear model for each experi- 
ment, we can fit a common linear regression vector for all the 
experiments. Let X := [Xj , and y := [yl , ■■■,yli] 

denote the set of predictor variables and the response vari- 
ables for the whole data respectively. Then the least square 
estimate is 

h=(x^x)-^xV 

3.3 An EM Algorithm (EM) 

In order to motivate this approach, recall the mathemat- 
ical model with Gaussian noise introduced towards the end 
of Section [2] Suppose we have K < M distinct regression 
vectors and let pk denote the probability that the fcth re- 
gression vector is chosen for an experiment. Since we don't 
know the clusters, the cluster index of experiments is not 
known. Treating this as the missing data, we get a link to 
the EM setup in [9]. Using this link, in Appendix [A| we 
derive an EM algorithm that iteratively computes estimates 
of class probabilities pu, noise variance , and the true re- 
gression vectors. The steps in this algorithm are described 
in the following. 



Iterations of the EM algorithm: 

- Require: an integer K,l < K < M\ a. guess for the 
number of clusters. 

- Initialize: hfe_(o),Pfc,(o), ~ 1,2,...,K and o"(o); initial 
guess for the K regression vectors, their assignment prob- 
abilities and the noise variance respectively. 

- For the t-th iteration: 

E step: For m = 1,2, ...,M, k ^ 1,2, ...,K, 



7mfc,(t) — 



Pfc,(t-1) nn-^(y™"|hr,{t-l)Xm„,0-^(_i)) 



EfcPfe,(t-l)n„-^(2/™n|h^(,_,)X^„,fT2_^j) 

M step: For fc = 1,2, ...,K 



where and are as in Q in Appendix [A] 
and 



3.4 A K-means Algorithm (KM) 

The mathematical model considered for motivating the 
EM algorithm above also provides a basis for this algorithm, 
which is reminiscent of the classical K-means clustering al- 
gorithm [4j Section 9.1]. We start with an initial list of 
K regression vectors. In each iteration of the algorithm, 
for each experiment, we find the regression vector from the 
list of K regression vectors that leads to the least MSB. 

^We assume throughout that Nm > d and the data is well- 
conditioned so that the desired matrix inverses exist. 



This leads to a clustering of the experiments. We pool to- 
gether data from all the experiments in a cluster and find 
the least-squares regression vector. This yields a new list of 
K regression vectors and the method continues. 



Iterations of the K-means algorithm: 

- Require: an integer K,l < K < M; a guess for the 
number of clusters. 

- Initialize: hj''\ hj''"', h^'; initial guess for the K re- 
gression vectors. 

- For the t-th iteration: 
For m = 1,2, ...,M, 



km argmin 



iin^(y„ 



For fe = 1, 2, ...,M, find least square estimate h^k'^^^ using 
data from the experiments with k,n = k. 



3.5 A Singular Value Thresholding Algorithm 

(SVT) 

If we have Kq < d clusters as in our mathematical model, 
then the matrix H :— [hi, ...,hM] of the regression vectors 
has low rank. Motivated by this, consider the following op- 
timization problem. For a given matrix G, let MSE(G) 
denote the resulting MSE: 

MSE(G) = ^^(y^n - g^x,„„)^ 



m n 



where g^ is the mth column of G. The low-rank assumption 
suggests that for an e > 0, we should solve 



minimize rank(G),such that MSE(G) < e. 



(2) 



This is a rank minimization problem with a quadratic con- 
straint, and we use the singular value thresholding (SVT) 
approach of [7j Section 3.3.2] to solve it. (In Appendix [B| 
we provide more details how Q falls under the formulation 
of T.) 

3.6 The Curds and Whey Method (CW) 

To utilize correlation among response variables, j6j intro- 
duced a method called Curds and Whey that finds an "op- 
timal" linear combination of the individual regression esti- 
mates. We use a "local" version of this method here, which 
we describe next. The method takes an integer parameter 
T < M and starts by forming individual least-squares re- 
gression vectors. For any two experiments, we think of the 
Euclidean distance between their estimates as the distance 
between the experiments. For each experiment, we then 
form a list of T closest experiments (including the experi- 
ment itself), and this set is denoted by T. Consider now the 
predictor vector x,„„ and suppose the individual estimates 

(t) ^ T 

are yinii h^ Xmn, t = 1,2, ...,r, where ht is the t-th in- 
dividual estimate from the list of closest experiments. To 
obtain the final estimate, we consider a linear estimate of 
the form 

Vmn ~ ^ ] /^th^ X^nn- 

The parameters fit are chosen by minimizing the MSE for 
each experiment m separately. 

We note that the CW method fits a least square regres- 
sion on the data in T, restricted to the span of {htjter- 





MSE 


Classification error 


Runtime (sec) 


Method 


SMALL 


MEDIUM 


LARGE 


SMALL 


MEDIUM 


LARGE 


SMALL 


MEDIUM 


LARGE 


IR 


.7312 


.9494 


1.7746 


.0619 


.1097 


.1628 


1.5322 


17.581 


145.2082 


CR 


.7351 


.7865 


.80 


.0536 


.0846 


.0967 


1.4213 


16.9712 


134.2348 


EM 


.6244 


.6924 


-NA- 


.0530 


.0840 


-NA- 


507.6 


18880 


-NA- 


KM 


.6227 


.6965 


-NA- 


.0522 


.0842 


-NA- 


88.84 


421.2 


-NA- 


SVT 


.5896 


-NA- 


-NA- 


.0510 


-NA- 


-NA- 


15.8 


-NA- 


-NA- 


CW 


.6762 


.7270 


0.7399 


.0525 


.0841 


.0968 


1.8566 


24.609 


841.717 


LoR 


.6229 


.7012 


.7291 


.0520 


.0836 


.0982 


1.6042 


22.961 


834.597 



Table 1: Error and Runtime comparison of various methods on real data 



Suppose T>T denotes the indices of all the trials restricted to 
the experiments in T. Then the CW estimate for the mth 
regression vector is 



arg mm 

hSspan{{fit}t£^ 



h Xi 



Suppose Ht denotes the dxT matrix whose columns are the 
individual estimates of the T most similar experiments. Let 
Xm^T and Ym.r denote the predictor variables and the cor- 
responding response variables, restricted to the experiments 
in T. For a predictor variable Xi G X.m,Ti let zt := xf H7-, 
and suppose 7im,T denotes the matrix whose rows consists 
of these vectors z^'s. Then the least square estimate for the 
/3- vector for experiment m is 

/3 = (Zm,rZm,r) ^Zm^7-ym,r, 
and the corresponding estimate for the regression vector is, 

hm = y^/3tht. 

t 

3.7 Local Regression (LoR) 

In this approach, similar to the CW method, for every 
experiment, we first find the T most similar experiments. 
Let Xm,T and ym.r denote the predictor vectors and the 
response scores restricted to these T nearest experiments. 
Then the final estimate for the mth regression vector is 



hm — fX„ 



Remark 1. Let H7- denote the dxT matrix of the T indi- 
vidual estimates closest to an experiment. When span(H7-) = 
R"*, i.e., the individual estimates span the whole space, then 
the CW method is same as the LoR method. Otherwise, the 
CW method finds a set of estimates that are restricted to a 
subspace (linear span of the individual estimates). 

Remark 2. While the EM and KM algorithms perform 
joint regression and clustering, the other methods do not per- 
form any explicit clustering. 

4. PERFORMANCE COMPARISON 

In this section, we compare the different methods on var- 
ious samples of the YLRC dataset as well as on data simu- 
lated as per the Gaussian model. Along with MSE and CE, 
we also discuss the algorithms in terms of their runtime and 
complexity. 

4.1 Evaluation on YLRC Dataset 

The feature vectors of the original YLRC dataset are 700 
dimensional and sparse. Motivated by compressed sensing 



of sparse vectors using random projections 15 8], for com- 
putational tractability, we project the feature vectors to a 
randomly chosen 20 dimensional space, that is, each feature 
vector is replaced by a 20 dimensional sketch. We work 
only on a subset of the queries available in the dataset. The 
LARGE dataset consists of 5000 queries, the MEDIUM dataset 
consists of 1000 queries, and the SMALL dataset consists of 
100 queries. 

We split each of these datasets into training (70%) and a 
test dataset (30%) randomly. Given the training data and 
the feature vectors of the test data, we want to predict the 
corresponding relevance scores. We compare the methods 
on the basis of their MSE and CE on the test datasets. To 
calculate the CE, we classify the scores as HIGH (3,4) or LOW 
(0,1,2), and count the fraction of times when a HIGH score 
is estimated as LOW or vice versa. We also compare all the 
methods based on the runtimes of their respective Matlab 
implementations on an Intel Xeon 4-core 2.67 GHz machine 
with 16GB of RAM. Some of the methods considered have 
an input parameter (e.g., number of neighbors in the LoR 
and CW methods; number of clusters in the EM and KM 
algorithms). For comparison, for each algorithm, we choose 
the value of the input parameter that yields the least MSE. 
The MSE and the CE are averaged over 10 realizations of 
the random splitting into training and test datasets. 
Error comparison: Table [T] summarizes the performance 
of all the algorithms. For the SMALL dataset, we see that the 
SVT algorithm performs the best, while the LoR, EM and 
KM methods are not too far behind. But for the MEDIUM 
dataset, we could not run the SVT algorithm in our setup 
due to large memory requirements. For this dataset CR 
performs better than individual regressions, and the EM al- 
gorithm shows the best performance, while the LoR method 
is quite close. If we observe the running times of the algo- 
rithms, the EM and KM algorithms are very slow compared 
to the LoR and CW methods. In fact, for the LARGE dataset, 
the EM and KM algorithms take too long and we could not 
report their MSE. For this dataset, CR gives more than 
100% improvement over IR, and the LoR method gives fur- 
ther 10% improvement. The CW method also shows similar 
(slightly worse) performance. Hence we see that the LoR 
method has a great advantage over global algorithms like 
the EM, KM and SVT algorithms in terms of the runtime, 
and also yields competitive performance in terms of MSE as 
well as CE. 

Predictor variables are not clustered: The CRUC frame- 
work and the associated algorithms are most relevant when 
the data does not exhibit clustering but the regression vec- 
tors are clustered. (However, we note that all the above al- 
gorithms work even when the data is clustered.) The YLRC 




Figure 1: Sensitivity and running time comparison of the LoR and CW methods as number of neighbors T 
varies for the MEDIUM dataset 



dataset appears to be an example of such a situation. To 
demonstrate this, we consider an arbitrary query, and let A 
denote the matrix of feature vectors restricted to the 10 most 
similar queries. The second largest eigenvalue of the row- 
normalized correlation matrix of A is never more that 0.23|5 
which suggests that the predictor variables are not clustered. 
But the regression vectors across queries show clustered be- 
havior and hence the various methods proposed show sub- 
stantial improvement over IR. Moreover, these methods also 
outperform CR, which indicates that there are several query 
clusters. 

Impact of algorithm parameters: The LoR and CW 

methods take as input a parameter T for the neighborhood 
size, while the KM and EM algorithms need a parameter K 
representing the cluster size. We next study the sensitivity 
of the MSE to the choice of these parameters. 

In Figure 1(a) we compare the MSE of the LoR and CW 
methods as we vary T. We see that when T is large, perfor- 
mance of the CW method matches with the LoR method, 
which is consistent with Remark [l] in Section [3.71 From the 
experiments on the simulated data in Section [4.3| we shall 
see that in a high SNR regime, the optimal neighborhood 
size is very close to the true cluster size. Figure |l(a)| thus 
suggests that the regression vectors are clustered, and the 
effective cluster size for the MEDIUM dataset is close to 9. 
Figure [l (b) | shows that the running time of the LoR method 
increases linearly with T, whereas the CW method shows 
a super-linear growth. (In Section |4.2[ we show that the 



runtime of the CW method is 0(T'')T)Thus while both the 
LoR and CW methods yield near best MSE and CE, the 
runtime of the LoR method is smaller. 



Figure 2(a) shows that the performance of the KM al- 
gorithm is less sensitive to its parameters compared to the 
EM algorithm. However, both these methods are more sensi- 
tive to under-estimation of the optimal parameter compared 
with the LoR and CW methods. From Figure [2(a) 



that runtime the KM algorithm scales better than the EM 
algorithm, but both are worse compared with the LoR and 



*Prom the spectral clustering literature 10 
number of eigenvalues close to unity is an 
number of clusters. 



we know that 
estimate of the 



CW methods. 

To understand the above runtimes, we further analyze the 
computational complexity of all the different methods in the 
next section. 

4.2 Complexity Comparison 

To avoid cluttered expressions, in this section we assume 
there are same number of trials in each experiment, and 
we denote this number by A'^. We also assume that the 
dimension parameter d is of constant order. 
IR: To compute the mth estimate, we need 0{N) operations 
to multiply a dx N matrix with its transpose; constant 
number of operations to invert a, dxd matrix Xj^X^; 0{N) 
operations to multiply this dxd inverse with another dx N 
matrix X^; and 0{N) operations to multiple this resulting 
dxN matrix by a vector . Thus we need 0{N) operations 
for each of the estimates, requiring 0{MN) operations in 
total. 

CR: By a similar analysis as above, we see that we need 
0{MN) operations to do a complete regression. 
EM: In the E-step of each iteration, we need 0{MNK^) 
operations. In the M-step, we need 0{MNK) operations to 
construct the matrix A and the vector b, and, a constant 
number of operations to compute A~^b. To compute the 
variance, we need 0{MNK) operations. Thus for the EM 
algorithm, we require 0{MNK^) operations per iterations. 
KM: At each iteration, starting with a list of K regression 
vectors, for each experiment, we need 0{NK) operations 
to obtain the regression vector that best explains the data. 
Thus we require 0{MNK) operations for clustering the ex- 
periments. After grouping the experiments based on the 
closest regression vector, suppose there are ruk experiments 
corresponding to the fcth vector. To do the corresponding 
regression, we need OirnuN) operations, resulting in a to- 
tal of 0{MN) operations for all the K regressions. Thus 
we need 0{MNK) operations for each iteration of the KM 
algorithm. 

SVT: In every iteration of this method, we need to multiply 
a Md X MN matrix by a AfiV x 1 vector, requiring 0{M^N) 
operations, and then to perform the singular value thresh- 
olding on a M X d matrix, we need 0{M) operations. Thus 




Figure 2: Sensitivity and running time comparison of the EM and KM algorithms as the cluster size K varies 
for the SMALL dataset 



Method 


Complexity 


IR 


0{MN) 


CR 


0{MN) 


EM (per iteration) 


0{MNK'^) 


KM (per iteration) 


O(MNK) 


SVT (per iteration) 


0{M'''N) 


CW 


0{hPT + MNT'-^) 


LoR 


0{M'^T + MNT) 



Table 2: Complexity comparison of methods 



in each iteration of this algorithm, we need 0{M^N) oper- 
ations. 

CW: After we have performed all the individual regressions 
for each experiment, we need to compute the T most simi- 
lar experiments. To do this we need to compute similarities 
with all the experiments, requiring 0{M) operations; and to 
find the T most similar experiments require 0(TM) opera- 
tions. Then to compute the matrix Zm,r we need 0{NT^) 
operations; and to compute the regression using this ma- 
trix requires 0{T'^N) operations. Thus to compute the es- 
timates for all the experiments, we need 0{M^T + MNT^) 
operations, including the operations needed to compute the 
individual estimates. 

LoR: As in the CW method, we need 0{MT) operations 
to find the T most similar experiments. Then to perform a 
regression on the data of these T experiments require 0{TN) 
operations, requiring a total of 0{MNT + M^T) operations 
for all the M estimates. 

We summarize the complexity of these algorithms in the 
Table We see that the LoR method has a linear growth 
with T, unlike the CW method that has a cubic growth. 
This is consistent with the empirical evaluation as seen in 
Similarly, the KM algorithm shows a linear 



Figure 1(b) 



growth with K whereas the EM algorithm shows a quadratic 
growth, which is consistent with the observation in Figure 
Table r 



2(b)| Table also shows that the EM and KM algorithms 
have a linear growth with M, whereas the LoR and CW 
methods has quadratic growth. This is unlike what we see 
on real data, as shown in Table [T] where the run times of 



the EM and KM algorithms scale much worse with M com- 
pared to the LoR and CW methods. This could be due to 
bad constant factors for the EM and KM algorithms. IR 
and CR has a linear growth with M , as is consistent with 
performance on real data shown is Table [l] 

Based on the above discussions, we see that the LoR 
method is attractive from several viewpoints: it provides 
near best MSE and CE performance at reasonable compu- 
tational load and is also less sensitive to the choice of the 
input parameter than other methods. To understand it bet- 
ter, in the next section, we consider simulations and some 
mathematical analysis of the LoR method. 

4.3 Evaluation on Simulated Data 

Recall the mathematical model for the data with Gaussian 
noise as introduced towards the end of Section (2] We use 
M — 100 experiments, d — 6 feature vector dimension, Ko = 
5 number of clusters, and N = 70 trials per experiment. 
We average the MSE over several realizations of the true 
regression vectors and the data. 

In Figure 3(a) we see how IR, CR, the LoR method and 
the EM algorithm perform as noise variance changes. When 
the noise level is low, the LoR method can find the right 
neighbors and hence collaboration helps. The LoR method 
shows better MSE than IR and CR, and for small noise lev- 
els, as expected its MSE is less than IR by a factor equal 
to the cluster size. As the noise level increases, it becomes 
harder to find the right neighbors, and the optimal LoR 
method picks all the neighbors to perform the regression and 
performs as good as CR. This phenomenon is more clear in 
Figure 3(b) which shows that when the noise level is low, 
the optimal value of T (the neighborhood size) is close to the 
actual cluster size, and as the noise level increases the opti- 
mal T is roughly same as the total number of experiments. 
In other words, for high noise levels, the best strategy is to 
filter out noise by pooling data from all the experiments. In 
between the two extremes of high and low noise, the LoR 
method provides a graceful transition by balancing the need 
to average out noise and the need to better estimate the 
regression vector. 

We also plot the performance of the EM algorithm as the 




noise level changes. Since the EM iterations can converge to 
a local maxima, we estimate the outage probability (prob- 
ability that the iterations converge to a wrong optimum), 
and compute its MSE only over the trials that converge to 
something close to the true regression vectors. Figure 3(a 



shows that the EM algorithm and the LoR method has a 
very similar performance for all noise levels, while we ob- 
serve an outage probability of 9.7%, averaged over all noise 
levels. 

High SNR asymptote:We assume that in the high SNR 
regime, the optimal LoR method picks all the right neighbors 
to perform the local regression, and in this case, the estimate 
corresponds to a maximum likelihood (ML) estimate. Hence 
to find the high SNR asymptote, we can use standard results 
about ML estimate. If there are T neighbors, each with 
data points, then we have a local regression with TN 
data points for each experiment. Let h™ be the estimated 
regression vector for mth experiment. We have the following 
result. 

Proposition 1. For each m = 1,2, M, 
^/TiV(h„-h„0 ^AA(0,a'l), 

where I is the dxd identity matrix, N{^, E) denotes a Gaus- 
sian distribution with mean fi and covariance matrix E, and, 
the convergence is in distribution. 

Proof. Let Pr[ymn] denote the likelihood of a trial. Then 
the {i,j)th entry of the Fisher information matrix (see 
Section IV.E.l]) is defined as 
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where the last equality follows since Wmn ~ N(0,ct^) and 
Xmn are chosen as i.i.d. unit normal. Now standard asymp- 
totic normality of maximum likelihood estimators for vector 



parameter estimation [14, Section IV.E.l] tells us that 
y(r7V)(h™ - h„) ^ AA(0, ') = AA(0, a'l). 

□ 

Proposition [l] suggests that for a large enough TN, (hm — 
hm) is almost normal with mean and variance equal to 
7^1. Thus the mean MSE 
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' trace — ^^1 
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In Figure [3(a)[ we compare this asymptotic MSE using T = 
M/Ko, with the empirical performance on simulated data, 
and observe that the LoR method performs almost as good 
as the asymptote even at SNRs up to dB. 

5. CONCLUSION 

We considered the CRUC framework, where experiments 
within a cluster show similar relationship between the pre- 
dictor variables and the response variables. We introduced 
and studied various methods, and based on experiments 
on the Yahoo Learning-to-rank Challenge dataset and on 
simulated data, we observe that the local regression (LoR) 
method is a good practical choice. It shows near optimal 
performance, scales reasonably well with data size, and is 
relatively insensitive to choice of algorithm parameter. We 
also analyze the LoR method for an associated mathemat- 
ical model that helps us to understand its prediction per- 
formance and optimal parameter choice. But we have only 
scratched the surface and we hope to present a deeper anal- 
ysis of LoR in a future manuscript. A detailed study of the 
more general non-parametric CRUC case also appears to be 
a fruitful direction for further investigation. 

APPENDIX 

A. DERIVATION OF THE EM ALGORITHM 

Suppose Zmk £ {0, 1} is an indicator whether the fcth re- 
gression vector gfe is sampled for the mth experiment. Then 
we have Y,k ^"^k = 1- 



A bit of notation before we begin: Y :— (yi, y2, Ya/)"^! 
and G ■— (gi, gA-)- Similarly ■— (zmi, ZmK)"^ , and 
Z :— (zi, za/)^. 

Let pk be the probability of picking the fcth regression 
vector. Then J2k=iP'' ^ 1' ^'^'^ suppose p := (pi,...,pk)^ 
represents the vector of probabilities. We can write the like- 
lihood as the following mixture. 

K 

G] = y^pfcA/"(ymn|gfc Xm„, g-^), 



and 



Pr[Y\X, G] = n n-^(2^™ISfc '^')- 

m fc — 1 n 

The ML estimate finds an estimate that maximizes the above, 
i.e., 

Gml = arg max Pr[Y|X, G]. 

G,a,p 

Because it is a mixture, it is not easy to maximize the above 
directly. Instead we take an approach based on EM algo- 
rithms to maximize the likelihood. In the EM based algo- 
rithm, we need to find |4j Section 9.3], [s] 

arg max Ep,.[z|y,x,g] logPr[Y, Z|X, G]. 

G,cr,p 

In our case this can be computed. We see that Pr[zm.k ~ 
1] = Pk, since the fcth regression vector is picked with prob- 
ability Pk, and thus 

Pr[z^]^l[pl"^\ 

k 

and Pr[ymn\y!-mn,G,Zmk = 1] = A/'(ymn!gfeX™.„,CT^). Thus 
we have 

Pr[ym„\^mn, G,Zm] = A/'(j/mn j gfeXm„ , (T^ . 



This implies that 

Pr[Y|X,G,Z] ( l[^iyr^n\s'^^r^n,cT 

and 

Pr[Y,Z\X,G] = l[l[[pkl[^{ymn\g.Un.,.,a^)^ 

(3) 



m k \ n 



m k \ n 



Suppose 7mfc denotes the expected value of z^k w.r.t. the 
posterior distribution Pr[Z|Y,X, G]. Then we have 

7mfe = lEpr[Z|Y,X,G]-Zmfe = Pr[Zmk = 11Y,X, G] 

^ Pr[Y,Zmk = 1|X,G] 
Pr[Y!X, G] 

Pfc n„ A/'(ymn I gfc X™„ , CT^ ) 



Using the above and (|3|, we see that 
S ~ Ep,,[z|Y,x,G] logPr[Y, Z|X, G] 

= ^^7™.* I logpfc +ElogA/'(2/mnigfcX™„,(J 

m k \ n 

m k 

E( 1 f7r~ I {ymn — gfe Xm,n) 
log V27ra + 



m,k,n 

To maximize w.r.t G, we find that 
dS 



dgk 







• E E 7mfc(l/mn — gfc Xm,„)x,„„ = 



•EE 



m 71 



^m-kymn^mn 



m n 



And to maximize w.r.t. a, we obtain 
dS 



(4) 



da 



= 



■E 

m,n,fc 



7m fc 



1 {ymn Sfc ^mn) 



(5) 



To maximize w.r.t. to probability vector p, we have the 
following Lagrangian to be maximized. 

•= E logPfc + ^ I E P''- ^ ^ 
By setting its derivative w.r.t. pk to zero, we obtain 

'Y^'lmk/Pk + A = 0, fc = 1, ...,K. 

m 

Using the above and the constraint that '^kP'' ~ ^' 
obtain 



Pk 



-,fc = l,...,i^. 



(6) 



EfcP*n„-'^(ymn|gfcX„„,Cr2 



Equation Q , Q and Q completes the derivation of the 
EM iterations. 

B. MSE AS A QUADRATIC CONSTRAINT 
IN SVT ALGORITHM 

The optimization problem to be solved is to minimize rank 
of the matrix with columns as the regression vectors, under 
an MSE constraint, i.e., 

minimize raiik(G), such that MSE{G) < e. 

Let b := (yn, ...,yMN)'^ denote the MN x 1 vector of all 
the relevance scores. And suppose A be the block diagonal 
matrix whose ith diagonal block {i = 1, M) is the matrix 
X, := (xa,...,Xijv)^ £ K'^''''. Also let g — vec(G) £ 
jjdjv/xi jjgjjQ^;g ^jjg vector obtained by stacking columns of 



G one after another. Suppose ^(G) := Xg. A{.) is a linear 
map from R'^^' to R*^^. Then we have 

MSE{G) = ||b-yi(G)||, 

where ||.|| denotes the Frobenius norm of a matrix. Thus the 
rank minimization problem can be written as 

minimize rank(G), such that ||b — ^(G)|| < e, 

which has the exact same formulation as the rank minimiza- 
tion problem considered in [t] Section 3.3.2]. 
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